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ABSTRACT 

A discussion of criterion-referenced measures is 
presented. Two characteristics define the criterion-referenced 
measure; the presence of a performance criterion, and test items 
keyed to a set of behavioral objectives. The performance criterion, 
in an educational setting, is usually a relative standard of 
performance. There are two ways of constructing items for a 
criterion-referenced test; the item-form approach and the 
specification of objectives. Item reliability can be assessed by 
calculating the proportion of subjects whose items scores (pass or 
fail) are the same on a posttest and a retest, or on a posttest and a 
parallel form. A measure of score reliability can be obtained by 
calculating the mean item reliability; it may also be assessed using 
the concept, of within-sub ject equivalence of total scores. Another 
index that can be used to assess item and test quality combines the 
concepts of reliability and validity. The most important part of a 
criterion-referenced measure is the set of behavioral objectives the 
measure is based ori. These objectives set the stage for judging the 
effectiveness of the teacher’s instruction, and evaluating the 
student’s learning. (CK) 
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Measurement theory has traditionally ccncerned itself with the accurate 
estimation and interpretation of an individual's score in relation to the scores 
of *■" or individuals. Measures yielding such scores have been known as norm- 
referenced, la contrast to norm- referenced measures are criterion- referenced 
measures that yield scores for which the interpretation is not dependent on their 
position I- relation to other stores. The interpretation is, however, depend, -n l 
on the specific conUv.it .of the item. . the measure and the degree to which the 
individual has attained criterion per forma nee. Two characteristics, then, define 
a cri te lion -cneed measure: the prest". : :• of a performance criterion, and 
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test items keyed to a set of behavioral, objectives. 

The porforviifci.ee criterion, in an educational settl. -g, is usually a relative 
standard of pt.i'fori*iatn o. It is bs.t-od on oinA expectations, and revised whoa th. > 
expectations ay. unrealistic. Although a criterion-referenced measure ct uld 

be scored dichotomoualy, i. e. , pass or fall, thorv is no reason why it cannot be 
scored as a norm-referenced meuvrre. 

There are essentially two different approaches to the construction of items 



mHz 



for criterion- referenced measurers (see Popbam, 1970), The first of these 
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approaches uses an item form to generate a population, of items, all of which 
measure the same objective. The second approach is to generate the items 
by whatever means are available and, on an empirical basis, to revise or 
delete those items that do not pm form as desired. 

Regardless of the procedure used to construct criterion-referenced 
measures, traditional methods of evaluatin': norm-referenced measure s may 
at times be inappropriate for criterion - 1 ofe-vone ' measures. Traditional 
methods depend on variability and criterion-referenced measure -, in the 
ideal case, yield score distributions with zero variance. Even in less than 
ideal situations, criterion-referenced m er. sures yield skewed distributions, 
with numerous identical scores, thus vitiating the app 1 ation of traditional 
indices of item and te.-t quality. 

As mentioned earlier, there are two v.ay.J of constructing items for 
a criterion- referenced test and one’s choic • these methods is primari 1 */ 
determined by the nature of the behavioral objectives. The item-form ap- 
proach works well In areas like mathematics where the object’’ ves can he 
very narrowly defined (cn g. , Kriewa.ll, 1969). In le ss structured content 
areas, however, the specifics. tin of objectives in such detail may not be 
feasible (e. g. , Hills, 1970). In deference to the classroom teacher, it 
may not be practical to ask for such specificity for the pool of objectives 
would be much to large to handle easily. 

if a pool of items keyed to an objective is generated by whatever 
means are available to the item writer, then item difficulty is an important 
concept. Within a pool of items on p. g’vert objective, U U> certainly con- 



ceivable that the difficulty of 



some items may bo more appropriate 



than 
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others, and that revisions or deletions may bo advantageous. Such informa- 
tion can bo obtained from pretost and posttest difficulty values for the items. 
Within each item pool, those items with difficulty values that are perceptibly 
different from the remaining items in the pool would be suspect. By using 
the remaining items in the pool as a control group, ri.val hypotheses such as 
prior knowledge or faulty instruction can be eliminated as being the deter- 
miners of such aberrant values. 

In the optimal case, an item used in •*. criterion- referenced measure 
would have a zero or ci ancc -level difficulty value on a pretest and a 1. 00 
value on the posttest. For such an item, it would be clear that instruction 
was needed, and that ‘instruct was effective. A high difficulty value on 
the pretest would cause one to examine the item for specific determiners 
or some other clues which pointed to the answer. In the absence of these, 
one might concede that instruction or. the t.. r !c wo id A be wasteful. A low 
difficulty value on the posttest would suggest that thn o were ambiguities 
in the item, that distraclors wore more similar than the distinctions that 
the student had been taught to make, or that there was a flaw in the instruc- 
tion. An index as simple as the difference between the two difficulty values 
may be used as an item selection index for criterion-referenced test i.tems. 
From a pool of six items on each of ten objectives, this author (1970) con- 
structed two criterion- referenced tests using this difference index to select 
items. For each objective, the two items with the larger values went in the 
first form of the test and the two items with the lower values went in the 
second form of the test. Marked differences in the quality of the tests were 
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apparent when these tests were administered to a new sample of students. 

Two separate studies (Cox and Vargas, 1 9 C 6 ; Pophain, 1970) have 
compared this difference between the upper and lower 27 percent who passed 
the item on the posttest. The findings indicate that the pretest-posttest 
difference index selects different items than traditional item-analysis indices 
based on an item’s discriminating abi'/ity on a posttest only. 

In norm- referenced testing,’ item -total correlations are computed to 
ask directly a question about the homogeneity of the items, and indirectly 
a que r (ion about the validity of each item. In criterion-referenced testing, 
item homogeneity is of primary concern when we arc examining the items 
written for a given objective. 

If the criterion- referenced measure is constructed without the use 
of th v.. item-form, item home retiei ty and content validity can bo assessed 

tii 

through the pretest and posltest difficulty values. Cnee again, the pool of 
items on a given objective i3 used as the control against which each item 
is evaluated. For a given objective, similarly low difficulty values on the 
pretest and similarly high difficulty values on the posttest imply that the 
set of items is homogeneous. 

Item homogeneity across objectives would be of concern if the 
objectives, for some reason, could be considered dependent on each other. 

The logic underlying such a dependency would necessitate the homogeneity 
of the items. A lack of homogeneity would point to one of two possible con- 
clusions. Either the items were hot adequately reflecting the objectives, 
or the objectives were sufficiently independent of each other to vitiate the 
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assumed dependency. 

Item reliability can be assessed by calculating the proportion of sub- 
jects whose item scores (pass or fail) are the same on a posttest and a retest, 
or on a positosi and a parallel form. In the first nstancc, the index is a 
measure of item stability, and in the second, the index is a. measure of item 
equivalence. In both cases, however, the maximum value of one would re- 
flect perfect agreement across all subjects. 

A measure of score reliability can be obtained by calculating the 
mean item reliability. An advantage to this method of calculating score 
reliability is that one is able to identify the particular items that are causing 
an undesirably low score reliability, thus allowing one to delete or revise 
those i tenor. 

Score reliability may also be assessed using the concept of within- 
subject equivalence of total scores. For each subject, the raw scores from 
two test administrations, either tost-retest or parallel forms, would be 
converted into percent-correct scores. For each examinee, the absolute 
difference between the percent correct u*n the two administrations would bo 
obtained. It is hoped, of course, that these percent-difference scores 
would be small - -an indication of high reliability. The actual reliability 
index would consist of reporting the percent of subjects with percent- 
difference scores of a given size or less, e. g. , a difference of 5 percent 
points or less. To compare reliabilities across tests, one might report 
two kinds of information for each method; the percent of scores agreeing 
within say 5 percent, and the percentage interval within which say 90 percent 




5 



Ivons 



6 



of the scorns agree. For example, it may be reported that for a given test, 
stability is reflected in that for 84 percent of the examinees, scores upon 
retesting after one week with no intervening instruction agree with scores 
on the earlier test within 5 percent, and that for 90 percent of the examinees, 
the retest score is within 8 percent of the score attained by that examinee on 
the earlier test. 



The discussion so far has been concerned with possible analogues 
to the traditional concepts of item difficulty, item selection, and reliability. 

Another index that can be used to assess item and test quality combines the 
concepts of reliability and validity. This index requires three administrations 
of the same test to the same subjects; once as a pretest, once as a posttest, 
and once as a retest. If the test is functioning as expected, scores would be 
near the chance level on the pretest, and near mastery on the posttest and the 
retest. Thus, for each item and f: " each soloed, wc would expect a maximum 
change in performance from pretest to posttest, and a minimum change from 
posttest to retest. 

The index consists of calculating for each item the value of the expres- 
sion 

( £ post - £ pro) ( 1 - | £ retest - £ post | ) 
where £ represents the proportion of subjects passing the given item on the par- 
ticular administration. This index can range from a maximum value of one to a 
minimum values of minus one. Values less than zero can only be attained if the 
proportion of subjects passing the item on the pretest is greater than the proportion 
passing on the posttest- -clearly an undesirable occurance in criterion-referenced 

testing. 
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As stated earlier, this index is a combination reflecting both reliability 
and validity. The first term in the expression is an index of validity in that it 
reflects performance between the pretest and the posttest. The second term 
reflects reliability (stability) in that it reflects performance from the post test 
to the retest. 

Although the previous discussion was concerned with the use of this 
index to assess item quality, it can be used to assess overall test quality. The 
test index is obtained by averaging the index values across all items of the test. 

A similar measure of instructional effectiveness based on the ratio of 
actual gain to maximum possible gain from pretest to -posuest has boon suggested 
(see McGuigan and Peters, 19f‘5; Brennan, 1970). Although this index may be 
useful, it appears to suffer from a lack of a theoretical basis for judging test 
cffecti veness. This can be illustrated by the following hypothetical example. 
Assume that two tests, A and *•!>, with maximum possible scores of 20 wer ^ 
administered as pretests and posttests to the same subjects. The pretest and 
posttest means for test A were 4 and 12, respectively, and the corresponding 
values for B were 12 and 16. This yields index values of . 50 for both tests 
because test A showed a gain of 8 out of 16 possible points and test B showed 
a gain of 4 out of 8 possible points. Although the index values indicate the 
two tests were equally effective, there appears to be no rationale for such a 
decision. Further investigation is needed to determine what magnitude of 
gains in what part of the score scale constitute equal effectiveness. 

The most important part of a criterion-referenced measure is the set 
of behavioral objectives the measure is based on. These objectives set the 
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stage for judging the effectiveness of the teacher’s instruction, and evaluating 
the student’s learning. Without carefully written objectives, the task of con- 
structing a criterion-referenced test is self-defeating. Although the ideas 
presented in this paper may serve as an aid in assessing item and test quality 
for criterion-referenced measures, they cannot replace the creative artistry 
of the item writer. 
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